Project Introduction

Project Description

Project 2 in IA650 involves analyzing electric load across the state of New York. New York Independent System Operators (NYISO) manage the flow of reliable electricity across New York. NYISO is responsible for cost-effectively matching offers from energy producers with consumer utility demand to supply electric power for the state. NYISO also oversees the delivery of power from generators to the utility companies.  * Learn more 

Electric load is forecasted and monitored hourly, across eleven NYISO zone areas. The forecasted load values are highly utilized in NYISO’s day to day operation. Such as, balancing power, demand response etc.  * NYISO Zonal Map 

The results provided in this report will be reproduceable to an analyst evaluating or furthering the report findings. Each assumption made and step taken in this report will be carefully described. When applicable, relative web links for additional readings and supporting points will be provided. All datasets, code, and logic will be provided and thoroughly explained. All answers provided will follow the CRISP-DM framework. The final report will be generated using RStudio via editing an .RMD file, and output to an .HTML file via knitting. 

All supporting files will be provided in this project submission, including knitted .html, rmd code, .xlsx files used, etc. 

Project configuration setttings

The project report will start by performing a few housekeeping items:  * Loading libraries used  * Aligning various configuration settings 

# Libraries to be used in the project report
library(broom)
library(chron)
library(corrplot)
library(e1071)
library(forecast)
library(fpp2)
library(ggpubr)
library(ggthemes)
library(janitor)
library(kableExtra)
library(knitr)
library(lubridate)
library(plotly)
library(psych)
library(qrsvm)
library(RColorBrewer)
library(readxl)
library(skimr)
library(stats)
library(summarytools)
library(tibbletime)
library(tidyverse)
library(vcd)

User-Defined functions

In this section of the introduction, one user defined function is declared.

  • modeUDF supporting link | modeUDF credit 
    • Purpose: to determine mode value of a factor variable 
# Declare modeUDF
modeUDF <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]}

Business Understanding

In this section of the project report, a sense of business understanding will be developed and explained with respect to the problem at hand with electric load forecasting. The business understanding framework is an important exercise because it helps to align the analyst with background information and real-world semantics with regard to the data frame(s) being analyzed.

Determine Business Objectives

In the electricity distribution market, pricing decisions for customer energy consumption are influenced by many factors. Some examples of influential factors to pricing decisions are the cost of energy distribution, fuel, and power plant operation. The United States federally allows each state legislature to govern if customer electricity consumption costs are to be regulated, aside from all other costs associated with the process to the business.

The predicted electric load amount is a crucial aspect of the electricity pricing model; it is one of the main factors for pricing decisions. Load forecasting is at the core of nearly all decisions made in energy markets, stemming from the strong influence on the pricing model. The electricity market follows this behavior. Considering that pricing decisions are so important to business health, it would prove helpful to better understand how the current electric load forecasting model is performing, and explore if it can be improved upon.

Accurate load forecasting can help to better refine many business processes, such as contract evaluation, energy purchasing and generation, load switching, and infrastructure development. Under-estimation by a supplier in an electric load forecast may lead to higher operational costs, due to quick start units cost more for generating power. Over-estimation by a supplier in an electric load forecast may lead to un-economic operation of large generation units, leaving the company with un-used purchased electric energy. It has been discovered that a load forecast error of 1% in terms of Mean Absolute Percentage Error (MAPE) can translate into several hundred thousand dollars for a utility’s financial bottom line. This statistic emphasizes that businesses in the energy market need to have accurate load forecasting.

This report will focus on the electricity loads in the state of New York, which is a state with regulated electricity pricing enforced. Electricity market decisions are made via a collection of independent system operators. Our study focuses on New York Independent System Operator (NYISO) load data available at the data center weblink provided in the data understanding section of this report. NYISOs are currently performing point forecasting methods to analyze load forecast usig Regression and artificial neural network. We will use different approach to do the load forecasting and compare the results with NYISO.

supporting link | Factors Affecting Electricity Prices

Assess Situation

Identification of Stakeholders

The stakeholders of the process are: 
* each Independent System Operator (ISO)  + This report focuses on NYISO  * Local distribution companies (LDC)  + eg. National Grid  * Electric generation stations  * Customers of electric distribution companies 

Interests of the Identified Stakeholders

Ensuring a forecast is accurate, as to align on spending and resource management.  * The more accurate the forecast, the less money an ISO will lose in buying too much energy and not being able to use it, or not buying enough energy and having to go back and buy more at a higher price.  * For the LDC’s, a more accurate model means they will be able to better predict the price for the customer and better align their business operations, ensuring higher profit.  * Generation stations benefit from a more accurate load prediction model as well, since they will be able to better predict what electric generation stations are required to be running.  * The customers will benefit from a better model as the price paid for electricity will be more consistent, avoiding higher prices with understimation in load forecast. 

Data Mining Goals

The purpose of the data mining methods used in this report are to resimulate the forecasting of electric load values across the NYISO. After running new forecasting models, the report will check performance of the forecasted values against the original forecasts used by NYISO.

Project Plan

As suggested by Hong(2010), this project report will use three years of actual electric load data (Jan 1 2015 to Dec 31 2017) as training data to build a model that calculates the electric load forecast for the next day. Following the model generation, the report will use a rolling forecast on the actual data of a time period to recalculate the model and forecast the next day for the year of 2018 (Jan 1st 2018 to Nov 15th 2018).

This report will use two forecasting methods to produce possible electric load values.  1. Linear Regression Model  2. ARIMA Model 


Data Understanding

In this section of the project report, the data understanding topics from the CRISP-DM framework are explored.

Collect Initial Data

Three sets of raw data were collected from the NYISO data load website:  * Actual Electric Load data  + Under Actual Load section, select Real-Time dropdown  + Select option to generate a custom report  + Select all available Zones  + Select start date of 01/01/2015, end date of 11/15/2018  + Select to download in .CSV format  * Forecast Electric Load data  + Under Load Forecast section, select NY Load Forecast dropdown  + Select option to generate a custom report  + Select all available Zones  + Select start date of 01/01/2015, end date of 11/15/2018  + Select to download in .CSV format  * Weather data  + Under Load Forecast section, select Load Forecast Weather Data dropdown  + Select option to view archive  + Select Archived Files (zip format) for months of 01-2015 thru 11-2018  + Unzip each monthly archive and store in all .CSV files in a central folder 

data download link | NYISO Data Center

Each set of raw data downloaded will be initially stored as a collection of .CSV files. There is a process to merge multiple .CSV files into a central .CSV file, using either the command line (Windows/Linux) or the terminal (MacOS). Here is documentation on how to convert the .CSV files using both platforms:  * Windows/Linux | How to merge multiple .CSV files using Windows/Linux  * MacOS | How to merge multiple .CSV files using MacOS 

Once data are in three combined .CSV files, the files need to be merged and converted to .XLSX files. The two merged load data .CSV files (actual and forecasted load) are converted into a single .CSV and then saved as an .XLSX, nys_electric_load.xlsx. The merged weather data is saved as an .XLSX, nys_weather_data.xlsx.  * Each of the conversions from .CSV to .XLSX is done using Microsoft Excel  * Please Note: The column header names are changed to be more “code friendly”  + See what the column headers are in the next Describe Data section, and change the headers in the raw .XLSX sheets to match 

# Upload raw electric load and forecast data from .xlsx file
load.init <- read_excel("nys_electric_load.xlsx")

# Upload raw weather forecast data from .xlsx file
wthr.init <- read_excel("nys_weather_data.xlsx")

Describe Data

After the raw data frames are loaded, the structure of each data frame is inspected to gain an understanding of the various types of variables amongst the raw data.

# Structure of raw load data frame
str(load.init)
## Classes 'tbl_df', 'tbl' and 'data.frame':    373560 obs. of  9 variables:
##  $ date_hour    : POSIXct, format: "2015-01-01 00:00:00" "2015-01-01 00:00:00" ...
##  $ gmt_hour     : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ day          : chr  "THUR" "THUR" "THUR" "THUR" ...
##  $ zone         : chr  "CAPITL" "CENTRL" "DUNWOD" "GENESE" ...
##  $ zone_id      : num  61757 61754 61760 61753 61758 ...
##  $ load_actual  : num  1281 1814 644 1042 1070 ...
##  $ load_forecast: num  1264 1651 645 973 1029 ...
##  $ rmse         : num  16.8 162.7 1.2 69.4 40.8 ...
##  $ mape         : num  1.329 9.855 0.186 7.133 3.965 ...

Initial Load Data | load.init data frame described, with comments about significant observations  * date_hour, POSIXct | Time stamp of each observation in yyyy-mm-dd hh:mm:ss format  * gmt_hour, num | Greenwich Mean Time (GMT) of each observation  + This data attribute is redundant, as it is not necessary to encode the hourly time stamp in two attributes  * day, chr | Day of week that the observation is made  + This chr variable will need to be formatted to a factor variable  * zone, chr | NYISO zone in which the observation is made, by name  + This chr variable will need to be formatted to a factor variable  * zone_id, num | NYSIO zone in which the observation is made, by ID  + This data attribute is redundant, as it is encoding the same information as zone  * load_actual, num | Actual hourly electric load value observed  * load_forecast, num | Forecasted hourly electric load value made  * rmse, num | Root Mean Squared Error (RMSE) measure of the observation  + The RMSE value was calculated from rows existing in the raw data  * mape, num | Mean Absolute Percentage Error (MAPE) measure of the observation  + The MAPE value was calculated from rows existing in the raw data 

The formulas for MAPE and RMSE are well-known, but can be explicitly found in Yang et al. (2018).

# Structure of raw weather data frame
str(wthr.init)
## Classes 'tbl_df', 'tbl' and 'data.frame':    27212 obs. of  10 variables:
##  $ forecast_date: POSIXct, format: "2015-01-01" "2015-01-01" ...
##  $ date         : POSIXct, format: "2014-12-31" "2014-12-31" ...
##  $ station_id   : chr  "ALB" "ART" "BGM" "BUF" ...
##  $ max_temp     : num  26 25 20 22 25 30 32 34 31 21 ...
##  $ min_temp     : num  15 8 13 16 11 21 23 28 27 10 ...
##  $ temp_diff    : num  11 17 7 6 14 9 9 6 4 11 ...
##  $ max_wet_bulb : num  21 22 16 21 20 24 25 26 25 18 ...
##  $ min_wet_bulb : num  13 7 11 13 10 18 20 23 22 9 ...
##  $ diff_wet_bulb: num  8 15 5 8 10 6 5 3 3 9 ...
##  $ zone         : chr  "CAPITL" "MHK VL" "MHK VL" "WEST" ...

Initial Weather Data | wthr.init data frame described, with comments about significant observations  * forecast_date, POSIXct | Date in which a forecast was made for the observation  + This variable is not important to this analysis  + This project is more focused on the true date of the observation  * date, POSIXct | Date of the weather information for the observation  * station_id, chr | Weather station at which the observation was made  + Each NYISO zone contains potentially many weather stations  + The weather station was used to determine which NYISO zone the observation pertains to  * max_temp, num | Maximum temperature value (in degrees farenheight) achieved in the observation date  * min_temp, num | Minimum temperature value (in degrees farenheight) achieved in the observation date  * temp_diff, num | Numerical value of the difference in maximum and minimum temperatures achieved in the observation date  + This value was calculated based on the formula max_temp - min_temp  * max_wet_bulb, num | Maximum humidity level achived in the observation date  * min_wet_bulb, num | Minimum humidity level achived in the observation date  * diff_wet_bulb, num | Difference in humidity levels achived in the observation date  + This value was calculated based on the formula max_wet_bulb - min_wet_bulb  * zone, chr | NYISO zone in which the observation is made, by name  + This chr variable will need to be formatted to a factor variable 

Verify Data Quality

Data quality can be measured in various ways. However, this project report looks at data quality in terms of data accuracy, completeness (missing data), consistency (outliers), and uniqueness (duplicate data). The following sub sections evaluate data quality in this fashion.
* Note: in terms of data accuracy, it is assumed that the data utilized is highly accurate
+ The NYISO groups provide the most accurate data available with respect to electric load
+ The airports across the state of NY provide the most accurate weather data available

Explore Missing Data

After initial inspection of the load data that was downloaded for the intended time frame, there were no hourly observations that were missing.

After initial inspection of the weather data that was downloaded for the intended time frame, there was one date with no recorded observations. The date of 09/10/2017 was found to have missing data values.
* In order to move past this, the weather values from the next date of 09/11/2017 are copied into the date of 09/10/2017.
* This step is done to ensure that each date in the time frame had values that were present.
+ Statistical models described below rely on a complete time series in order to function the best they can

Explore Duplicate Data

After inspection of the initial load and weather data frames, there were no duplicated data values.
* This was determined because there was a single observation for each unit of time intended to be used by the analyses

Explore for Outlier Data

Outliers are not considered in this report, as all load value is important to NYISO in analyses. Every load needs to be taken care of, even the high and low points.

Note on Determinations Made from Data Quality Exploration

The explorations noted directly above were an attempt to address data quality concerns. However, not all of these explorations were intended to correct mistakes made in the data. The exploration of outlier data was not intended to correct a mistake made in raw data frames downloaded. However, the missing data points that were found are considered to be a mistake, as theoretically all data points should be available. Also, if duplicate data points were found, that would also be a mistake, but on the fault of the analyst for not preprocessing the data correctly.


Data Preparation

In this step of the project report, steps will be taken to prepare the raw data frame to be better suited for analysis. Not all attributes in the initial data frame are necessary to analyze. There were some deficiencies in the way the data was loaded (as described in the Data Understanding section) that need to be fixed. A final set of data frames will also need to be made in order to align the data to the programming performed.  * There will be multiple data frames created:  + To transform various parts of each raw data frame  + To create various data frames for different analytic approaches 

Select Data

The variables that are not important to analysis were outlined in the Data Understanding section. They will not be pulled out of the data frame here, rather handled in the Construct Data section through the use of dplyr chains. This section of the report will not clean out the unnecessary data attributes from the raw load.init and wthr.init data frames.

Clean Data

This section of the project report will ensure the data frames are cleaned and optimized. The initial part of the cleaning phase is to ensure there is no white space or differences in letter casing amongst the data values of the data frame being used to run analyses.  * Initial names based on time interval nature of raw data. 

# Load data - Hourly data frame creation
load.hour <- load.init %>% mutate_if(is_character, ~str_to_upper(.) %>% str_trim())

# Weahter data - Daily data frame creation
wthr.day <- wthr.init %>% mutate_if(is_character, ~str_to_upper(.) %>% str_trim())

Identified deficiencies in the variable types are now handled.

# Load data frame manipulation
load.hour$gmt_hour <- as.factor(load.hour$gmt_hour)
load.hour$day <- as.factor(load.hour$day)
load.hour$zone <- as.factor(load.hour$zone)
load.hour$zone_id <- as.factor(load.hour$zone_id)

# Weather data frame manipulation
wthr.day$station_id <- as.factor(wthr.day$station_id)
wthr.day$zone <- as.factor(wthr.day$zone)

Construct Data

This section of the project report will construct and manipulate various data frames that will be used in analyses to follow.

There are multiple weather stations per NYISO zone in the weather data. Each weather station is located inside of an NYISO zone. The weather station data values are aggregated based on the NYISO zone.

# Aggreage weather data by NYISO zone
wthr.day <- wthr.day %>%
  mutate(date = (date(date))) %>%
  group_by(date, zone) %>%
  summarize(max_temp = mean(max_temp), min_temp = mean(min_temp),
            temp_diff = mean(temp_diff), max_wet_bulb = mean(max_wet_bulb),
            min_wet_bulb = mean(min_wet_bulb), diff_wet_bulb = mean(diff_wet_bulb))

Since the load.hour and wthr.day are divided up by different time units (hours vs days), hourly electric load data must be aggregated into daily electric load data.

# Load data - Daily data frame creation
load.day <- load.hour %>%
  mutate(date = (date(date_hour))) %>%
  group_by(date, zone) %>%
  summarize(day = as.factor(modeUDF(day)), load_actual = sum(load_actual), load_forecast = sum(load_forecast))

After creating load.day, it can be observed that wthr.day has more days contained in it than load.day does. In order to align these data frames, the time frames must be equal.  * Remove some of the values in the wthr.day data frame 

# Weather data - Slice daily frame to match time frame of daily load
wthr.dayslice <- wthr.day[12:15576,] # 1/1/15 - 11/15/18

It would also be helpful to have an aggregated daily count of the load and weather values, across all zones.  * This would represent NY-state wide data 

# Load data - Aggregate daily frame creation for NYS
load.dayagg <- load.day %>%
  group_by(date) %>%
  summarize(day = as.factor(modeUDF(day)), load_actual = sum(load_actual), load_forecast = sum(load_forecast))

# Weather data - Aggregate daily frame creation for NYS
wthr.dayagg <- wthr.day %>%
  group_by(date) %>%
  summarize(max_temp = mean(max_temp), min_temp = mean(min_temp),
            max_wet_bulb = mean(max_wet_bulb), min_wet_bulb = mean(min_wet_bulb)) %>%
  slice(2:1416) # 1/1/15 - 11/15/18

Integrate Data

Now that both the aggregated and daily zonal weather and load data frames are within the same time frame, merge them to make analysis purposes simpler.  * There will be a new data frame created for the daily zonal data  * There will be a new data frame created for the aggregated data 

# Merge both data frames
df.day <- merge(x = load.day, y = wthr.dayslice, by = c("date", "zone"), all = TRUE)
df.dayagg <- merge(x = load.dayagg, y = wthr.dayagg, by = c("date"), all = TRUE)

Format Data

It can be observed in df.day and df.dayagg that the date attributes are in Date format. In order to better pursue time series analyses, these data frames are copied into another data frame so the date attributes can be changed to POSIXct format.

# Create data frames more suiteable for time series
ts.day <- df.day
ts.day$date <- as.POSIXct(ts.day$date)
ts.dayagg <- df.dayagg
ts.dayagg$date <- as.POSIXct(ts.dayagg$date)

# need to make new slice data frames to use with linear models utilizing diffs
df.dayagg.slice <- df.dayagg %>% slice(1:1415) # 1/1/15 - 11/15/18
ts.dayagg.slice <- ts.dayagg %>% slice(2:1416) # 1/1/15 - 11/14/18

Explore Data

After data has been cleaned, merged, and formatted, it can now be explored. Basic statistical figures were generated to gain a sense of what the data frames look like. Gaining exploratory knowledge around a data frame aids in data understanding efforts, as well as makes the analyst better equipped to perform model generation.

In order to run some statistical tests, it would be helpful to remove all data variables that are non-numeric.

# Remove all data attributes that are non-numeric from ts.dayagg
num.dayagg <- ts.dayagg %>% select(-c(date, day, load_forecast))

The next set of code will provide low level statistics around the num.dayagg data frame.

describe(num.dayagg) %>% kable(digits = 2, booktabs = T) %>%
  kable_styling(latex_options = c("striped"))
vars n mean sd median trimmed mad min max range skew kurtosis se
load_actual 1 1415 438412.80007 59468.07435 424239.30000 432019.41712 52455.12930 337975.100000 650913.40000 312938.30000 0.9709803 0.6478428 1580.9045732
max_temp 2 1415 59.70616 19.46218 61.70455 60.69011 24.75493 5.628788 93.28788 87.65909 -0.3378063 -0.9637889 0.5173844
min_temp 3 1415 42.83554 18.11195 43.98485 43.85496 21.74480 -10.136364 73.77273 83.90909 -0.3843855 -0.6683433 0.4814898
max_wet_bulb 4 1415 51.37014 16.46299 53.18182 52.28789 20.21727 3.287879 78.97727 75.68939 -0.3881000 -0.8373158 0.4376537
min_wet_bulb 5 1415 40.14868 17.84745 41.04545 41.06156 21.09335 -11.227273 71.31818 82.54545 -0.3453427 -0.7207986 0.4744582

The next set of code analyzes the correlation matrix of the numerical variables

cor.dayagg <- cor(num.dayagg)
corrplot.mixed(cor.dayagg)

General relationship exploration

Visualization of the most important variables.

ggplot(data = (ts.dayagg),
       aes(x = date, y = load_actual, color = day)) +
  geom_jitter(width = 0.15) +
  guides(fill = TRUE) +
  stat_smooth(aes(group = 1), method = 'lm', formula = y ~ poly(x, 2, raw=TRUE)) +
    ggtitle(label = "Aggregated Daily NYISO Load Amounts Over Time",
          subtitle = "Color Encodes Various Days") +
       xlab("Date") + ylab("Actual Load (in MegaWatts)")

From this graph, it can be observed that the load values experienced in summer and winter months are typically higher than those values experienced in spring and fall months.

ggplot(data = (ts.dayagg),
       aes(x = date, y = max_temp)) +
  geom_jitter(width = 0.15) +
  stat_smooth(aes(group = 1), method = 'lm', formula = y ~ poly(x, 2, raw=TRUE)) +
    ggtitle(label = "Daily Max Temperature Values Over Time",
            subtitle = "Max Temperature Values are represented by Means") +
       xlab("Date") + ylab("Max Temperature")

ggplot(data = (ts.dayagg),
       aes(x = date, y = min_temp)) +
  geom_jitter(width = 0.15) +
  stat_smooth(aes(group = 1), method = 'lm', formula = y ~ poly(x, 2, raw=TRUE)) +
    ggtitle(label = "Daily Min Temperature Values Over Time",
            subtitle = "Min Temperature Values are represented by Means") +
       xlab("Date") + ylab("Min Temperature")

It can be observed from the above two graphs that the maximum and minimum values for max_temp and min_temp are following a consistent pattern over time.

ggplot(data = ts.day,
       aes(x = max_temp, y = load_actual, color = zone)) +
  geom_jitter(width = 0.15) +
  guides(fill = TRUE) +
  stat_smooth(aes(group = 1), method = 'lm', formula = y ~ poly(x, 2, raw=TRUE)) +
  ggtitle(label = "Relationship Between Max Temp and Actual Load",
          subtitle = "Color Encodes Various Zones") +
  xlab("Maximum Temperature (in Degrees Farenheight)") + ylab("Actual Load (in MegaWatts)")

From this graph, it can be observed that the N.Y.C. zone uses the most amount of electricity across the NYISO zones. Also, if the temperature goes above around 70, the load seems to increase more rapidly than below 70 degrees.

ggplot(data = ts.dayagg,
       aes(x = max_temp, y = load_actual)) +
  geom_jitter(width = 0.15) +
  guides(fill = TRUE) +
  stat_smooth(aes(group = 1), method = 'lm', formula = y ~ poly(x, 2, raw=TRUE)) +
  ggtitle(label = "Relationship Between Max Temp and Actual Load",
          subtitle = "Color Encodes Various Zones") +
  xlab("Maximum Temperature (in Degrees Farenheight)") + ylab("Actual Load (in MegaWatts)")

ggplot((df.dayagg), aes(x = day, y = load_actual)) +
  geom_boxplot() + 
  geom_dotplot(binaxis='y', stackdir='center', 
               dotsize = .40, fill="green") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  ggtitle(label = "Relationship Between Day of Week and Actual Load") +
  xlab("Day of Week") + ylab("Actual Load (in MegaWatts)")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

From the graph above, it can be observed that Saturday and Sunday experience lower than average load values. The week days seem to have similar load values to each other, and are higher in relation to the weekends.

ggplot((df.dayagg), aes(x = day, y = max_temp)) +
  geom_boxplot() + 
  geom_dotplot(binaxis='y', stackdir='center', 
               dotsize = .5, fill="red") +
  theme(axis.text.x = element_text(angle=65, vjust=0.6)) +
  ggtitle(label = "Relationship Between Day of Week and Max Temp") +
  xlab("Day of Week") + ylab("Max Temp")
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.


Modeling and Modeling Evaluation

In this report, a linear regression model will be built, as well as an ARIMA model. Each of these models will be utilized to differently predict future electric load values.

Linear Regression Model

Build a linear regression model based on standardized inputs

# Build a time series linear regression model
tslm1 <- lm(formula = load_actual ~ date + day + max_temp +
            min_temp + max_wet_bulb + min_wet_bulb,
            data = ts.dayagg)

# Present a nice looking table of tslm1 results
tidy(tslm1) %>% kable(digits = 2, booktabs = T) %>%
  kable_styling(latex_options = c("striped", "hold_position"))
term estimate std.error statistic p.value
(Intercept) 637552.5523642 61817.1186285 10.3135275 0.0000000
date -0.0001493 0.0000414 -3.6067310 0.0003210
dayMON -8.8103836 5247.2839446 -0.0016790 0.9986606
daySAT -31194.4574959 5241.9426871 -5.9509345 0.0000000
daySUN -39164.6221716 5249.8660754 -7.4601183 0.0000000
dayTHUR 1998.7313605 5236.2842640 0.3817080 0.7027358
dayTUES 4833.5800230 5247.5413984 0.9211133 0.3571496
dayWED 3000.6115562 5248.8861096 0.5716663 0.5676396
max_temp -86.8491201 379.5166625 -0.2288414 0.8190255
min_temp -4060.3108104 1176.3015245 -3.4517602 0.0005735
max_wet_bulb -485.9084281 589.2538855 -0.8246164 0.4097296
min_wet_bulb 5844.6789449 1197.1753419 4.8820576 0.0000012

The beta values and p-values can be seen in the kable table above. Since date is not a reasonable value to regress upon, the linear model needs to be altered slightly in order to differently account for seasonality.

In order to account for seasonality, this report will take a naive approach at encoding seasonality through an average load variable.

# Create a Day of Year column
df.dayagg.doy <- df.dayagg %>% mutate(doy = yday(date)) 

# Create a dataframe with average load values by day of year
df.doyagg = df.dayagg.doy %>% group_by(doy) %>% summarise(avg_load = mean(load_actual))

# Join the averaged day of year value to each data observation
df.dayagg.new = left_join(df.dayagg.doy, df.doyagg, by="doy")

Now that there is a data frame with some seasonality encoded, try to run a different linear regression model.

# Build a time series linear regression model
tslm2 <- lm(formula = load_actual ~ day + doy + max_temp +
            min_temp + max_wet_bulb + min_wet_bulb + avg_load,
            data = df.dayagg.new)

# Present a nice looking table of tslm2 results
tidy(tslm2) %>% kable(digits = 2, booktabs = T) %>%
  kable_styling(latex_options = c("striped", "hold_position"))
term estimate std.error statistic p.value
(Intercept) 13923.8615559 9930.836119 1.4020835 0.1611117
dayMON 1387.3851783 3226.140852 0.4300448 0.6672291
daySAT -25633.6702719 3224.753376 -7.9490328 0.0000000
daySUN -31027.5198844 3231.991085 -9.6001255 0.0000000
dayTHUR 6014.8718279 3220.375269 1.8677549 0.0620047
dayTUES 8102.5450942 3226.971987 2.5108818 0.0121548
dayWED 10504.3754044 3230.890431 3.2512323 0.0011764
doy -3.7668430 9.296854 -0.4051740 0.6854115
max_temp -103.0804609 232.770362 -0.4428419 0.6579484
min_temp -199.1262290 717.608520 -0.2774859 0.7814480
max_wet_bulb 42.4003157 362.005215 0.1171263 0.9067768
min_wet_bulb 318.7559864 734.622503 0.4339045 0.6644246
avg_load 0.9790755 0.020566 47.6066240 0.0000000

Based on the model above, the p-values observed are high for the max_temp, min_temp, max_wet_bulb, and min_wet_bulb variables. Since normal business understanding informs that these variables are important, they are left in the linear regression model even though they break a threshold of 95% confidence.  * Note: If there was more time to analyze report findings, the linear models would be altered until more appropriate p-values are returned 

ggplot(tslm2$model, aes_string(x = names(tslm2$model)[4], y = names(tslm2$model)[1])) + 
  geom_point() +
  stat_smooth(aes(group = 1), method = 'lm', formula = y ~ poly(x, 2, raw=TRUE)) +
  ggtitle(label = "Relationship Between Max Temp and Actual Load with TSLM2",
          subtitle = paste("Adj R2 = ",signif(summary(tslm2)$adj.r.squared, 5),
                     "Intercept =",signif(tslm2$coef[[1]],5 ),
                     " Slope =",signif(tslm2$coef[[2]], 5),
                     " P =",signif(summary(tslm2)$coef[2,4], 5))) +
  xlab("Maximum Temperature (in Degrees Farenheight)") + ylab("Actual Load (in MegaWatts)")

Based on the plot above, it can be seen that tslm2 similiary resembles the relationship of actual load and maximum temperature.

checkresiduals(tslm2)

## 
##  Breusch-Godfrey test for serial correlation of order up to 16
## 
## data:  Residuals
## LM test = 1007.4, df = 16, p-value < 2.2e-16

Based on the residuals check on tslm2, they are normally distributed and under control. There is no observable pattern to be seen in the residual plot.  * Note: If there was more time to analyze the data, the explosion trends that can be seen around the summer months would be exploited more. 

ARIMA Model

ARIMA Model - CAPITL

As a code development aid, this project looks at only a small set of that data (i.e. single zone of NYISO service area) to refine the modeling process. 
*Note: The report then go back to refine and look at the whole dataset, ‘CAPITL’ is targetted for now. 

# Filter out daily load data for 'CAPITL' zone
capitl.daily <- load.day %>% filter(zone == 'CAPITL')

# Convert capitl.daily to time series object
capitl.ts <- ts(capitl.daily$load_actual, 
              start = c(2015, 1), 
              frequency = 365.25)

The next step is to examine a few years for seasonality across a few years.

par(mfrow=c(1,3))
# Check for seasonality, plotting data for single 2016 year
capitl.ts %>% window(end = 2016) %>%
  autoplot() + ylab("Actual Load (in MegaWatts)") + xlab("Year") +
  ggtitle("Load variations in 2016")

# Check for seasonality, plotting data for single 2017 year
capitl.ts %>% window(end = 2017) %>%
  autoplot() + ylab("Actual Load (in MegaWatts)") + xlab("Year") +
  ggtitle("Load variations in 2017")

# Check for seasonality, plotting data for single 2018 year
capitl.ts %>% window(end = 2018) %>%
  autoplot() + ylab("Actual Load (in MegaWatts)") + xlab("Year") +
  ggtitle("Load variations in 2018")

The data are clearly non-stationary, with some seasonality. The project must first take a seasonal difference of the load data. The seasonally differenced data are shown in the figure below.

# Displaying the plots resulting from taking seasonal difference
capitl.ts %>% diff(lag=365) %>% ggtsdisplay()

These plots also appear to be non-stationary, so the report will take an additional difference (second difference), shown in the figure below. This indicates the data to be stationary at this point.  * Can use the diff = 2 in ARIMA model 

# Adding a second difference
capitl.ts %>% diff(lag=365)%>% diff() %>% ggtsdisplay()

Now fit in different arima model and evaluate for AIC value and ACF plot.  * Low AIC value and ACF plot that takes care of time lags are preferred. 

# This fit gives good ACF plot
## Second lowest aic value and good p value
fit0 <- arima(capitl.ts, order = c(1,1,1), seasonal = list(order = c(0, 1, 1), period = 7))
fit0
## 
## Call:
## arima(x = capitl.ts, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 1), 
##     period = 7))
## 
## Coefficients:
##           ar1     ma1     sma1
##       -0.3151  0.5786  -0.9899
## s.e.   0.0803  0.0691   0.0112
## 
## sigma^2 estimated as 2771997:  log likelihood = -12446.57,  aic = 24901.15
# Ljung-Box test + plot of fit0
checkresiduals(fit0)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(0,1,1)[7]
## Q* = 835.71, df = 727.5, p-value = 0.003221
## 
## Model df: 3.   Total lags used: 730.5
## aic = 24901.15 p-value = 0.003436

# This fit gives good ACF plot
## Third lowest aic value and good p value 
fit1 <- arima(capitl.ts, order = c(0,1,1), seasonal = list(order = c(0, 1, 1), period = 7))
fit1
## 
## Call:
## arima(x = capitl.ts, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), 
##     period = 7))
## 
## Coefficients:
##          ma1     sma1
##       0.2941  -0.9898
## s.e.  0.0283   0.0109
## 
## sigma^2 estimated as 2802720:  log likelihood = -12454.31,  aic = 24914.62
# Ljung-Box test + plot of fit1
checkresiduals(fit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,1,1)[7]
## Q* = 844.12, df = 728.5, p-value = 0.001869
## 
## Model df: 2.   Total lags used: 730.5
## aic = 24914.62 p-value = 0.002342

# This fit gives bad ACF plot
fit2 <- arima(capitl.ts, order = c(0,0,1), seasonal = list(order = c(0, 1, 1), period = 7))
fit2
## 
## Call:
## arima(x = capitl.ts, order = c(0, 0, 1), seasonal = list(order = c(0, 1, 1), 
##     period = 7))
## 
## Coefficients:
##          ma1     sma1
##       0.7643  -0.5269
## s.e.  0.0131   0.0221
## 
## sigma^2 estimated as 4527319:  log likelihood = -12788.62,  aic = 25583.23
# Ljung-Box test + plot of fit2
checkresiduals(fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 6675, df = 728.5, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 730.5
## aic = 25583.23 p-value < 2.2e-16

# This fit gives good ACF plot
## Lowest aic value. however, p-value is little high - move to next lowest value of aic
fit3 <- arima(capitl.ts, order = c(1,1,1), seasonal = list(order = c(1, 1, 1), period = 7))
fit3
## 
## Call:
## arima(x = capitl.ts, order = c(1, 1, 1), seasonal = list(order = c(1, 1, 1), 
##     period = 7))
## 
## Coefficients:
##           ar1     ma1    sar1    sma1
##       -0.2854  0.5512  0.0902  -1.000
## s.e.   0.0749  0.0646  0.0268   0.029
## 
## sigma^2 estimated as 2733671:  log likelihood = -12441.12,  aic = 24892.23
# Ljung-Box test + plot of fit3
checkresiduals(fit3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(1,1,1)[7]
## Q* = 793.31, df = 726.5, p-value = 0.04281
## 
## Model df: 4.   Total lags used: 730.5
## aic = 24892.23 p-value = 0.05198

# Evaluating the aic values of each fitted model to decide which one to pursue
aics <- c('fit0' = fit0$aic, 'fit1' = fit1$aic,'fit2' = fit2$aic, 'fit3' =fit3$aic)
aics
##     fit0     fit1     fit2     fit3 
## 24901.15 24914.62 25583.23 24892.23

The AIC values are printed in the table above.

Analysis shows fit3 gives lowest aic value for capital.ts.  * However, the p-value is little high - so the analysis moves to next lowest value of aic.  fit0 gives good ACF plot.  * Second lowest aic value and good p-value as well.  Therefore, fit0 is choosen this model to use for forecasting of capital.ts

ARIMA Model - N.Y.C.

Now the project report is building an ARIMA model for the NYISO zone with highest daily electricity consumption, which is ‘NYC’. 

# Filter out daily load data for 'N.Y.C.' zone
nyc.daily <- load.day %>% filter(zone == 'N.Y.C.')

# Convert nyc.daily to time series object
nyc.ts <- ts(nyc.daily$load_actual, 
              start = c(2015, 1), 
              frequency = 365.25)

The next step is to examine a few years for seasonality across a few years.

par(mfrow=c(1,3))
# Check for seasonality, plotting data for single 2016 year
nyc.ts %>% window(end = 2016) %>%
  autoplot() + ylab("Actual Load (in MegaWatts)") + xlab("Year") +
  ggtitle("Load variations in 2016")

# Check for seasonality, plotting data for single 2017 year
nyc.ts %>% window(end = 2017) %>%
  autoplot() + ylab("Actual Load (in MegaWatts)") + xlab("Year") +
  ggtitle("Load variations in 2017")

# Check for seasonality, plotting data for single 2018 year
nyc.ts %>% window(end = 2018) %>%
  autoplot() + ylab("Actual Load (in MegaWatts)") + xlab("Year") +
  ggtitle("Load variations in 2018")

The data are clearly non-stationary, with some seasonality. The project must first take a seasonal difference of the load data. The seasonally differenced data are shown in the figure below.

# Displaying the plots resulting from taking seasonal difference
nyc.ts  %>% diff(lag=365) %>% ggtsdisplay()

These plots also appear to be non-stationary, so the report will take an additional difference (second difference), shown in the figure below. This indicates the data to be stationary at this point.  * Can use the diff = 2 in ARIMA model 

# Adding a second difference
nyc.ts %>% diff(lag=365)%>% diff() %>% ggtsdisplay()

Now fit in different arima model and evaluate for AIC value and ACF plot.  * Low AIC value and ACF plot that takes care of time lags are preferred. 

fit0 <- arima(nyc.ts, order = c(1,1,1), seasonal = list(order = c(0, 1, 1), period = 7))
fit0
## 
## Call:
## arima(x = nyc.ts, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 1), period = 7))
## 
## Coefficients:
##           ar1     ma1     sma1
##       -0.2584  0.4981  -0.9999
## s.e.   0.0825  0.0731   0.0203
## 
## sigma^2 estimated as 71898301:  log likelihood = -14741.87,  aic = 29491.74
checkresiduals(fit0)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(0,1,1)[7]
## Q* = 939.1, df = 727.5, p-value = 0.0000001673
## 
## Model df: 3.   Total lags used: 730.5
## aic = 29491.74 p-value = 0.004033

fit1 <- arima(nyc.ts, order = c(0,1,1), seasonal = list(order = c(0, 1, 1), period = 7))
fit1
## 
## Call:
## arima(x = nyc.ts, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), period = 7))
## 
## Coefficients:
##          ma1     sma1
##       0.2624  -1.0000
## s.e.  0.0281   0.0184
## 
## sigma^2 estimated as 72405026:  log likelihood = -14746.85,  aic = 29499.71
checkresiduals(fit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,1,1)[7]
## Q* = 952.27, df = 728.5, p-value = 0.00000003979
## 
## Model df: 2.   Total lags used: 730.5
## aic = 29499.71 p-value = 0.004097

fit2 <- arima(nyc.ts, order = c(0,0,1), seasonal = list(order = c(0, 1, 1), period = 7))
fit2
## 
## Call:
## arima(x = nyc.ts, order = c(0, 0, 1), seasonal = list(order = c(0, 1, 1), period = 7))
## 
## Coefficients:
##          ma1     sma1
##       0.7375  -0.5356
## s.e.  0.0134   0.0201
## 
## sigma^2 estimated as 118872734:  log likelihood = -15089.25,  aic = 30184.5
checkresiduals(fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 6094.1, df = 728.5, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 730.5
## aic = 30184.5 p-value <  2.2e-16

fit3 <- arima(nyc.ts, order = c(1,1,1), seasonal = list(order = c(1, 1, 1), period = 7))
fit3
## 
## Call:
## arima(x = nyc.ts, order = c(1, 1, 1), seasonal = list(order = c(1, 1, 1), period = 7))
## 
## Coefficients:
##           ar1     ma1    sar1     sma1
##       -0.2513  0.4911  0.0312  -1.0000
## s.e.   0.0811  0.0720  0.0268   0.0139
## 
## sigma^2 estimated as 71848195:  log likelihood = -14741.19,  aic = 29492.38
checkresiduals(fit3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(1,1,1)[7]
## Q* = 927.81, df = 726.5, p-value = 0.0000005392
## 
## Model df: 4.   Total lags used: 730.5
## aic = 29492.38 p-value = 0.01221

# Evaluating the aic values of each fitted model to decide which one to pursue
aics <- c('fit0' = fit0$aic, 'fit1' = fit1$aic,'fit2' = fit2$aic, 'fit3' =fit3$aic)
aics
##     fit0     fit1     fit2     fit3 
## 29491.74 29499.71 30184.50 29492.38

The AIC values are printed in the table above.

Analysis shows fit0 gives lowest aic values for nyc.ts.  * fit0 is chosen due to having lowest aic.  + This might help to conclude results about an perticular fitted model.

ARIMA Model - NYS

Now the report will build an ARIMA model including all eleven NYISO zones.

# Convert load.dayagg$load_actual to time series object
nyiso.ts <- ts(load.dayagg$load_actual, 
              start = c(2015, 1), 
              frequency = 365.25)

The next step is to examine a few years for seasonality across a few years.

par(mfrow=c(1,3))
# Check for seasonality, plotting data for single 2016 year
nyiso.ts %>% window(end = 2016) %>%
  autoplot() + ylab("Actual Load (in MegaWatts)") + xlab("Year") +
  ggtitle("Load variations in 2016")

# Check for seasonality, plotting data for single 2017 year
nyiso.ts %>% window(end = 2017) %>%
  autoplot() + ylab("Actual Load (in MegaWatts)") + xlab("Year") +
  ggtitle("Load variations in 2017")

# Check for seasonality, plotting data for single 2018 year
nyiso.ts %>% window(end = 2018) %>%
  autoplot() + ylab("Actual Load (in MegaWatts)") + xlab("Year") +
  ggtitle("Load variations in 2018")

The data are clearly non-stationary, with some seasonality. The project must first take a seasonal difference of the load data. The seasonally differenced data are shown in the figure below.

# Displaying the plots resulting from taking seasonal difference
nyiso.ts  %>% diff(lag=365) %>% ggtsdisplay()

These plots also appear to be non-stationary, so the report will take an additional difference (second difference), shown in the figure below. This indicates the data to be stationary at this point.  * Can use the diff = 2 in ARIMA model 

# Adding a second difference
nyiso.ts %>% diff(lag=365)%>% diff() %>% ggtsdisplay()

Now fit in different arima model and evaluate for AIC value and ACF plot.  * Low AIC value and ACF plot that takes care of time lags are preferred. 

fit0 <- arima(nyiso.ts, order = c(1,1,1), seasonal = list(order = c(0, 1, 1), period = 7))
fit0
## 
## Call:
## arima(x = nyiso.ts, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 1), 
##     period = 7))
## 
## Coefficients:
##           ar1     ma1     sma1
##       -0.2133  0.5241  -0.9907
## s.e.   0.0691  0.0595   0.0112
## 
## sigma^2 estimated as 446110447:  log likelihood = -16021.35,  aic = 32050.7
checkresiduals(fit0)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(0,1,1)[7]
## Q* = 903.86, df = 727.5, p-value = 0.000008073
## 
## Model df: 3.   Total lags used: 730.5
# aic = 32050.7 p-value = 0.02722

fit1 <- arima(nyiso.ts, order = c(0,1,1), seasonal = list(order = c(0, 1, 1), period = 7))
fit1
## 
## Call:
## arima(x = nyiso.ts, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 1), 
##     period = 7))
## 
## Coefficients:
##          ma1     sma1
##       0.3408  -0.9910
## s.e.  0.0270   0.0111
## 
## sigma^2 estimated as 449164761:  log likelihood = -16026.25,  aic = 32058.5
checkresiduals(fit1)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,1,1)[7]
## Q* = 909.07, df = 728.5, p-value = 0.000005284
## 
## Model df: 2.   Total lags used: 730.5
# aic = 32058.5 p-value = 0.03592

fit2 <- arima(nyiso.ts, order = c(0,0,1), seasonal = list(order = c(0, 1, 1), period = 7))
fit2
## 
## Call:
## arima(x = nyiso.ts, order = c(0, 0, 1), seasonal = list(order = c(0, 1, 1), 
##     period = 7))
## 
## Coefficients:
##          ma1     sma1
##       0.7738  -0.5244
## s.e.  0.0126   0.0207
## 
## sigma^2 estimated as 775231003:  log likelihood = -16409.31,  aic = 32824.62
checkresiduals(fit2)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,1)(0,1,1)[7]
## Q* = 7573, df = 728.5, p-value < 2.2e-16
## 
## Model df: 2.   Total lags used: 730.5
# aic = 32824.62 p-value < 2.2e-16

fit3 <- arima(nyiso.ts, order = c(1,1,1), seasonal = list(order = c(1, 1, 1), period = 7))
fit3
## 
## Call:
## arima(x = nyiso.ts, order = c(1, 1, 1), seasonal = list(order = c(1, 1, 1), 
##     period = 7))
## 
## Coefficients:
##           ar1     ma1    sar1     sma1
##       -0.2051  0.5169  0.0431  -0.9948
## s.e.   0.0676  0.0582  0.0273   0.0155
## 
## sigma^2 estimated as 444482133:  log likelihood = -16020.1,  aic = 32050.21
checkresiduals(fit3)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)(1,1,1)[7]
## Q* = 893.59, df = 726.5, p-value = 0.00002035
## 
## Model df: 4.   Total lags used: 730.5
# aic =  32050.21 p-value = 0.06666

# Evaluating the aic values of each fitted model to decide which one to pursue
aics <- c('fit0' = fit0$aic, 'fit1' = fit1$aic,'fit2' = fit2$aic, 'fit3' =fit3$aic)
aics
##     fit0     fit1     fit2     fit3 
## 32050.70 32058.50 32824.62 32050.21

The AIC values are printed in the table above.

Analysis shows fit0 and fit3 give lowest aic values for nyc.ts.  * fit0 is chosen for consistency.  + This might help to conclude results about an perticular fitted model. 

Deployment

Deployment techniques are done in the form of forecasting. The linear regression and ARIMA models previously described will be used in forecasting scenarios. After the forecasts are made, the MAPE and RMSE are recalculated to evaluate the accuracy of the forecasted models.

Linear Regression Forecasting

A test data frame is created for the purpose of running linear forecasting.

# Create a data frame with the testing time frame
df.dayagg.test <- df.dayagg.new %>% slice(1097:1415) # 12/31/17 - 11/15/18, 319 obs

The best linear regression model is run in the forecast.

# Forecasting tslm2
lm.forecast <- forecast(tslm2, newdata = df.dayagg.test, h = 1)
lm.fit.err <- accuracy(df.dayagg.new$load_actual, lm.forecast$fitted[1097:1415])
lm.fit.err
##                 ME     RMSE      MAE      MPE     MAPE
## Test set -7257.332 43679.29 35251.03 -1.93425 8.018138
#subsetting data points for 2018 from the full data set
nyiso.2018 <- load.dayagg[c(1097:1415),]
nyiso.err <- accuracy(nyiso.2018$load_actual,nyiso.2018$load_forecast)
nyiso.err
##                 ME     RMSE      MAE       MPE     MAPE
## Test set -14513.87 17899.49 15694.34 -3.446229 3.690309

The MAPE created with the linear forecasting for the testing timeframe is much higher than the normal MAPE (8.02 vs. 3.69). * Note: If additional time allowed, this tslm2 would be revisited and optimized, as to make a smaller forecasted MAPE value.

ARIMA Forecasting

Each ARIMA fit will be forecasted.

# Initialize a train data set (Jan 1st 2015 to Dec 31 2017) 
train <- window(capitl.ts,end=2018)

# Initialize a test data set (Jan 1st 2018 to Nov 15th 2018)
test <- window(capitl.ts,start=2018)

# Initialize a numeric value to be used in looping
n <- 319
Forecasting with the CAPITL Fitted ARIMA Model
# Initializing vector that will be used to store the forecasted values in loop
fcmat <- rep(0,n)

# Will be used inside the loop as a train data set
loopdat <- train

# loop for rolling forecast
for(i in 1:n)
{ 
loopdat <-capitl.ts[1:(1096+i)] #Jan 1st 2015 to Dec 31 2017>we have 1096 points
refit <- arima(loopdat, order = c(1,1,1), seasonal = list(order = c(0, 1, 1), period = 7))
fcmat[i] <- forecast(refit, h=1)$mean
}
## Warning in log(s2): NaNs produced
#to see the outputs... now set for 30 days
newdat <- data.frame(actual = capitl.ts[1097:1415], forecasted = fcmat[1:n])

#checking accuracy/ error measures of our model 
capitl.fitted.err <- accuracy(newdat$actual, newdat$forecasted)
capitl.fitted.err
##                ME     RMSE      MAE       MPE     MAPE
## Test set -3.74839 1220.945 875.1805 -0.107263 2.629809
#subsetting data points for 2018 from the full data set
capitl.2018 <- capitl.daily[c(1097:1415),]
#checking accuracy of NYISO's current forecasted practice for capitl zone
capitl.err <- accuracy(capitl.2018$load_actual,capitl.2018$load_forecast)
capitl.err
##                 ME     RMSE      MAE       MPE     MAPE
## Test set -1287.469 1660.987 1422.245 -4.107479 4.470823

The fitted MAPE (2.63) calculated from the forecasted values is much smaller than NYISO’s current forecasted values’ MAPE (4.47).  * This is a good sign. 

Forecasting with the NYC Fitted ARIMA Model
# Initializing vector that will be used to store the forecasted values in loop
fcmat <- rep(0,n)

# Will be used inside the loop as a train data set
loopdat <- train

for(i in 1:n)
{ 
loopdat <-nyc.ts[1:(1096+i)]
refit <- arima(loopdat, order = c(1,1,1), seasonal = list(order = c(0, 1, 1), period = 7))
fcmat[i] <- forecast(refit, h=1)$mean
}

#to see the outputs...
newdat <- data.frame(actual = nyc.ts[1097:1415], forecasted = fcmat[1:319])

#checking accuracy/ error measures of our model 
nyc.fitted.err <- accuracy(newdat$actual, newdat$forecasted)
nyc.fitted.err
##                ME     RMSE      MAE        MPE     MAPE
## Test set 2.762226 8289.385 5651.288 -0.1861356 3.930406
#subsetting data points for 2018 from the full data set
nyc.2018 <- nyc.daily[c(1097:1415),]
#checking accuracy of NYISO's current practice 
nyc.err <- accuracy(nyc.2018$load_actual,nyc.2018$load_forecast)
nyc.err
##                 ME     RMSE      MAE        MPE    MAPE
## Test set -1243.732 4971.582 3442.749 -0.8811916 2.25063

The fitted MAPE (3.93) calculated from the forecasted values is higher than NYISO’s current forecasted values’ MAPE (2.25).  * It is interesting same fitted model is good for one zone and bad for other zone. 

Forecasting with the NYS Fitted ARIMA Model
# Initializing vector that will be used to store the forecasted values in loop
fcmat <- rep(0,n)

# Will be used inside the loop as a train data set
loopdat <- train

for(i in 1:n)
{ 
loopdat <-nyiso.ts[1:(1096+i)]
refit <- arima(loopdat, order = c(1,1,1), seasonal = list(order = c(0, 1, 1), period = 7))
fcmat[i] <- forecast(refit, h=1)$mean
}

#to see the outputs...
newdat <- data.frame(actual = nyiso.ts[1097:1415], forecasted = fcmat[1:319])

#checking accuracy/ error measures of our model 
nyiso.fitted.err <- accuracy(newdat$actual, newdat$forecasted)
nyiso.fitted.err
##                ME     RMSE      MAE        MPE     MAPE
## Test set -37.0658 20269.63 14101.66 -0.1446101 3.240485
#checking accuracy of NYISO's current practice 
nyiso.err
##                 ME     RMSE      MAE       MPE     MAPE
## Test set -14513.87 17899.49 15694.34 -3.446229 3.690309

The fitted MAPE (3.24) calculated from the forecasted values is lower than NYISO’s current forecasted values’ MAPE (3.69).


Conclusion

In total, this report was a good exercise at developing and testing forecasting models.

From the analyses presented in this report, it can be concluded that the same fitted model is not good for all zones or any data set. This indicates the same conclusin as stated by T.Hong(2015):  - There is not a perfect model  - All forecasts can be improved  - Accuracy is never guaranteed 

In the future, additional forecasting models will be applied to the dataset to understand if there are better modeling techniques to determining electric load forecasting.


References

In this section of the report, full reference links are provided to accompany the in-line references with the explanations to sections of the project above. Various supporting links have also been restated, to ensure there is a dedicated reference section that is accurate.

The following references were made in line with text throughout the report.

[1] H. Hahn, S. Meyer-Nieberg, and S. Pickl, “Electric load forecasting methods: Tools for decision making,” Eur. J. Oper. Res., 2009.  [2] E. A. Feinberg and D. Genethliou, “Load Forecasting,” in Applied Mathematics for Restructured Electric Power Systems. Power Electronics and Power Systems., J. H. Chow, F. F. Wu, and J. Momoh, Eds. Boston, MA: Springer US, 2005, pp. 269-285.  [3] S. A. Soliman and A. M. Al-Kandari, “Mathematical Background and State of the Art,” in Electrical load forecasting: modeling and model construction, Elsevier, 2010, pp. 1-44.  [4] T. Hong, “Crystal Ball Lessons in Predictive Analytics,” Energybiz, 2015.  [5] Y. Yang, S. Li, W. Li, and M. Qu, “Power load probability density forecasting using Gaussian process quantile regression,” Appl. Energy, vol. 213, no. August 2017, pp. 499–509, 2018.  [6] T. Hong, “Short term electric load forecasting,” North Carolina State University, 2010.  [7] Electric Load Forecasting: Fundamentals and Best Practices by Tao Hong, David A. Dickey. Publisher: OTexts 2015. Number of pages: 306